<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
   <TITLE>Coco/R for C#</TITLE>
   <META NAME="GENERATOR" CONTENT="Mozilla/3.0Gold (Win16; I) [Netscape]">
</HEAD>
<BODY>

<H1 ALIGN=CENTER>Coco/R for C#</H1>

<CENTER><P>H. M&ouml;ssenb&ouml;ck<BR>
Johannes Kepler University Linz<BR>
Institute of Practical Computer Science </P>

<P>This version amended (17 September 2002) and extended by<BR>
<A HREF="mailto:p.terry@ru.ac.za">P.D. Terry</A><BR>
Computer Science Department<BR>
Rhodes University, Grahamstown</P>
</CENTER>

<P><I>You may obtain the latest version of this document from
<A HREF="http://cs.ru.ac.za/homes/cspt/cshcoco.htm">
http://cs.ru.ac.za/homes/cspt/cshcoco.htm</A></I>

<P>Coco/R is a compiler generator which takes a compiler description in the
form of an LL(1) attributed grammar and generates a scanner and parser for the
described language. This paper describes a version of Coco/R for C# that
slightly extends the original, and is a modified version of the original
documentation by M&ouml;ssenb&ouml;ck that you can read
<A HREF="http://www.ssw.uni-linz.ac.at/Research/Projects/Coco/">here</A>.
There are also implementations of Coco/R for
<A HREF="http://cs.ru.ac.za/homes/cspt/cocor.htm">Java, C, C#, Pascal, Delphi and
Modula-2</A> and for
<A HREF="ftp://ftp.ssw.uni-linz.ac.at/pub/Oberon/LinzTools/Coco.Cod">
Oberon</A>. </P>

<H2>Contents</H2>

<OL>
<LI><A HREF="#intro">Introduction</A> </LI>
<P>
<LI><A HREF="#InpLang">The specification language Cocol/R</BR>
</A>
2.1 Vocabulary<BR>
2.2 Structure of a compiler description<BR>
2.3 Scanner specification<BR>
2.4 Parser specification </LI>
<P>

<LI><A HREF="#UserGuide">User Guide<BR>
</A>3.1 The Parser interface<BR>
3.2 The Token interface<BR>
3.3 The Scanner interface<BR>
3.4 The Errors interface<BR>
3.5 The driver program<BR>
3.6 Grammar tests<BR>
3.7 Trace output </LI>
<P>

<LI><A HREF="#hints">Hints for Advanced Users of Coco/R</A> </LI>
<P>

<LI><A HREF="#Taste">A Sample Compiler</A> </LI>
<P>

<LI><A HREF="#Apps">Useful applications of Coco/R</A></LI>
<P>

<LI><A HREF="#Sources">Sources of Coco/R</A> </LI>
<P>

<LI><A HREF="#port">Portability issues</A> </LI>
<P>

<LI><A HREF="#Lit">Literature</A> </LI>
</OL>

<P><A NAME="intro"></A></P>

<H2>1. Introduction</H2>

<P>Coco/R takes an LL(1) attributed grammar for a language, expressed in
augmented EBNF, and generates C# code for a recursive descent parser and a
scanner for that language. To generate a complete application, such as a
compiler, a user has also to supply a simple main method that calls the
parser, as well as semantic classes that are used from within the grammar
(e.g., a symbol table handler and a code generator). </P>

<P><IMG SRC="Fig1.gif" NATURALSIZEFLAG="3" HEIGHT=100 WIDTH=318 ALIGN=BOTTOM><BR>
</P>

<P>The following example gives you a rough impression of the input language
(Cocol/R). It shows a grammar rule for the processing of declarations. </P>

<pre>  VarDeclaration&lt;ref int adr&gt;
                         (. Obj obj, obj1;
                            Struct typ;
                            int n, a;
                            string name; .)
  = Ident&lt;out name&gt;      (. obj = SymTab.Find(name);
                            obj.next = null;
                            n = 1; .)
  { ',' Ident&lt;out name&gt;  (. obj1 = SymTab.Find(name);
                            obj1.next = obj;
                            obj = obj1;
                            n++; .)
  }
  ':'
  Type&lt;out typ&gt;          (. adr += n * typ.size;
                            a = adr;
                            while (obj != null) {
                              a -= typ.size;
                              obj.adr = a;
                              obj = obj.next;
                            } .)
    ';'.
</pre>

The core of this specification is the <b>EBNF rule</b>
<pre>  VarDeclaration = Ident {',' Ident} ':' Type ';'.
</pre>

It is augmented by attributes and semantic actions. The <b>attributes</b> (e.g.
&lt;out name&gt;) specify the parameters of the symbols. There are input
attributes (e.g. &lt;x, y&gt;) and output attributes (e.g. &lt;out z&gt;)
that are written like parameters in C#.
The <b>semantic actions</b> are arbitrary C# statements between "(." and ".)".
Such semantic actions will be executed by the generated parser at the position
in the grammar at which the associated syntactic element is encountered.

<p>
Coco/R generates a parser method for every grammar rule. The parser method
of the above rule looks as follows:

<pre>  static void VarDeclaration (ref int adr) {
    Obj obj, obj1;
    Struct typ;
    int n, a;
    string name;
    name = Ident();
    obj = SymTab.Find(name);
    obj.next = null;
    n = 1;
    while (t.kind == ',') {
      Get();
      name = Ident();
      obj1 = SymTab.Find(name);
      obj1.next = obj;
      obj = obj1;
      n++;
    }
    Expect(':');
    typ = Type();
    adr += n * typ.size;
    a = adr;
    while (obj != null) {
      a -= typ.size;
      obj.adr = a;
      obj = obj.next;
    }
    Expect(';');
  }
</pre>


<P>In this code can also be seen the statements that interact with the
generated scanner, whose responsibility it will be to read the input file and
return a stream of tokens to the parser.</P>

<A NAME="InpLang"></A>

<H2>2. The specification language Cocol/R</H2>

<P>A compiler description can be viewed as consisting of imports,
declarations and grammar rules that describe the lexical and syntactical
structure of a source language as well as its translation into a target
language.

<H3>2.1 Vocabulary</H3>

<P>The vocabulary of this version of Cocol/R uses identifiers, strings,
characters and numbers:</P>

<PRE>
  CHARACTERS
    letter   = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
    digit    = "0123456789" .
    cntl     = CHR(0) .. CHR(31) .
    tab      = '\t' .
    lf       = '\n' .
    cr       = '\r' .
    back     = CHR(92) .
    noQuote1 = ANY - '"' - cntl - back .
    noQuote2 = ANY - "'" - cntl - back .
    graphic  = ANY - cntl .
  TOKENS
    ident  = letter {letter | digit} .
    string =   '&quot;' {noQuote1 | back graphic } '&quot;'
             | &quot;'&quot; {noQuote2 | back graphic } &quot;'&quot; .
    number = digit {digit} .
</PRE>

<P>Upper case letters are distinct from lower case letters. Strings must
not cross line borders.

<p>Within strings C# escape sequences are interpreted as such, thus for
example  '\t' representss CHR(9), and '\u0041' represents 'A'. This represents
a departure from some other versions of Coco/R and has been implemented
slightly differently in different releases.


<P>Keywords are </P>

<PRE>
   ANY         CASE        CHARACTERS  CHR         COMMENTS
   COMPILER    CONTEXT     END         EOF         FROM
   IGNORE      NAMES       NESTED      PRAGMAS     PRODUCTIONS
   SYNC        TO          TOKENS      WEAK        using
</PRE>

<P>The following metacharacters are used to form EBNF expressions: </P>

<table cellpadding="4">
<tr>
<td valign=top>
(   )<BR>
{   }<BR>
[   ]<BR>
&lt; &gt;<BR>
(. .)<BR>
= . | + -
</td>
<td valign=top>
for grouping<BR>
for iterations<BR>
for options<BR>
for attributes<BR>
for semantic parts<BR>
as explained below
</td>
<td valign=top>
(a | b) c ... a c | b c<br>
{a} b ... b | a b | a a b | a a a b | ...<br>
[a] b ... b | a b
</td></tr>
</table>

<P>Comments are enclosed in &quot;/*&quot; and &quot;*/&quot; and may be
nested. </P>


<H3>2.2 Structure of a compiler description</H3>

<P>A compiler description in Cocol/R is made up of the following parts:</P>

<PRE>
   Cocol =
     &quot;COMPILER&quot; GoalIdentifier
     ArbitraryCSharpDeclarations
     ScannerSpecification
     ParserSpecification
     &quot;END&quot; GoalIdentifier &quot;.&quot; .
   GoalIdentifier = ident .
</PRE>

<P>The name after the keyword COMPILER is the <I>grammar name</I> and
must match the name after the keyword END. The grammar name also denotes the
topmost nonterminal (the <I>start symbol</I>). </P>

<P>Arbitrary C# code may follow the grammar name. This is not checked by
Coco/R.  It usually contain declarations of fields and methods to be used in
the semantic actions of the parser class that is to be generated, for example.
</P>

<pre>  static int sum;

  static void Add(int x) {
    sum = sum + x;
  }
</pre>

In front of the keyword COMPILER one can import namespaces such as:

<pre>  using System;
  using System.Collections;
</pre>


<P>The remaining parts of the compiler description specify the lexical
and syntactical structure of the language to be processed. Effectively two
grammars are specified - one for the lexical analyzer or scanner, and the
other for the syntax analyzer or parser.  The nonterminals (token classes)
recognized by the scanner are regarded as terminals by the parser.</P>

<H3>2.3 Scanner specification</H3>

<P>A scanner has to read source text, skip meaningless characters, and
recognize tokens which are then passed to a parser. Tokens may be classified
either as <I>literals</I> or as <I>token classes</I>. Literals (like &quot;while&quot; and
&quot;&gt;=&quot;) may be incorporated directly into productions as
strings, and do not have to be named. Token classes (such as identifiers or
numbers) must be named, and have structures that are specified by regular
expressions, defined in EBNF. There are usually many different instances of a
token class (e.g., many different identifiers) which are all recognized as the
same kind of token. </P>

<P>A scanner specification consists of six optional parts that may be written
in arbitrary order. </P>

<PRE>   ScannerSpecification =
     { CharacterSets
     | Tokens
     | Ignorable
     | Comments
     | Pragmas
     | UserNames
     } .
</PRE>

<P><B>Character sets</B>. The <I>CharacterSets</I> component allows for the
declaration of names for character sets such as letters or digits, and defines
the characters that may occur as member of these sets. These names may be used
in the other sections of the scanner specification (but not in the parser
specification). </P>


<PRE>
   CharacterSets = &quot;CHARACTERS&quot; {SetDecl} .
   SetDecl       = SetIdent &quot;=&quot; CharacterSet .
   CharacterSet  = SimpleSet { (&quot;+&quot; | &quot;-&quot;) SimpleSet } .
   SimpleSet     = SetIdent | string | &quot;ANY&quot; | SingleChar [&quot;..&quot; SingleChar ] .
   SingleChar    = &quot;CHR&quot; &quot;(&quot; number &quot;)&quot; | character .
   SetIdent      = ident .
</PRE>

<P>Simple character sets are denoted by one of: </P>
<TABLE>
<TR>
<TD>string
<td> a set consisting of all characters in the string

<tr>
<td>SetIdent
<td> the previously declared character set with this name

<tr>
<td valign=top>CHR(i)
<td valign=top>a character set consisting of a single element with the ordinal value i
(i &lt;=255)

<tr>
<td valign=top>'x'
<td valign=top>a character set consisting of the single character &quot;x&quot;

<tr>
<td valign=top>CHR(i) .. CHR(j)
<td valign=top>a character set consisting of all characters whose ordinal values are in the
range i ... j

<tr>
<td valign=top>'x' .. 'y'
<td valign=top>a character set consisting of all characters in the range 'x' .. 'y'

<tr>
<td>ANY
<td> the set of all acceptable (8 bit ASCII) characters (CHR(1) .. CHR(255)).

</table>


<P>Simple sets may then be combined by the union (+) and difference (-)
operators.</P>

<P><I>Examples</I> </P>

<PRE>
  CHARACTERS
    digit = &quot;0123456789&quot; .            /* the set of all digits */
    hexdigit = digit + &quot;ABCDEF&quot; .     /* the set of all hexadecimal digits */
    eol = '\r' .                      /* the end-of-line character */
    noDigit = ANY - digit .           /* any character that is not a digit */
    ctrlChars = '\u0001' .. CHR(31) . /* ascii control characters */
</PRE>

<P><B>Tokens</B>. A <i>token</i> is a terminal symbol for the parser but a
syntactically structured symbol for the scanner. The token structure has to be
described by a regular expression in EBNF:</P>

<PRE>   Tokens      = &quot;TOKENS&quot; { Token } .
   Token       = TokenSymbol [ &quot;=&quot; TokenExpr &quot;.&quot; ] .
   TokenExpr   = TokenTerm { &quot;|&quot; TokenTerm } .
   TokenTerm   = TokenFactor {TokenFactor} [ &quot;CONTEXT&quot; &quot;(&quot; TokenExpr &quot;)&quot; ] .
   TokenFactor =   SetIdent | string
                 | &quot;(&quot; TokenExpr &quot;)&quot; |  &quot;[&quot; TokenExpr &quot;]&quot; | &quot;{&quot; TokenExpr &quot;}&quot; .
   TokenSymbol = TokenIdent | string .
   TokenIdent  = ident .
</PRE>

<P>Tokens may be declared in any order.  A token declaration defines a
<I>TokenSymbol</I> together with its structure. Usually the symbol on the left-hand
side of the declaration is an identifier, which is then used in other parts of
the grammar to denote the structure described on the right-hand side of the
declaration by a regular expression (expressed in EBNF).  This
<I>TokenExpr</I> may contain literals denoting themselves (for example
&quot;<TT>END</TT>&quot;), or the names of character sets (for example <tt>digit</tt>),
denoting an arbitrary character from such sets.  The restriction to regular
expressions means that it may not contain the names of any other tokens.</P>

<P>While token specification is usually straightforward, there are a number of
subtleties that may need emphasizing.  For example, since spaces are deemed to
be irrelevant when they come between tokens in the input for most languages,
one should not attempt to declare literal tokens that have spaces within
them.</P>

<P>The grammar for tokens allows for empty right-hand sides.  This may seem
strange, especially as no scanner is generated if the right-hand side of a
declaration is missing.  This facility is used if the user wishes to supply a
hand-crafted scanner, rather than the one generated by Coco/R (see Section 4).
In this case, the symbol on the left-hand side of a token declaration may
also simply be specified by a string, with no right-hand side. Tokens
specified without right-hand sides are numbered consecutively starting from 0,
and the hand-crafted scanner has to return token codes according to this
numbering scheme.</P>

<P>There is one predeclared token EOF that can be used in productions where
it is necessary to check explicitly that the end of the source has been
reached.  When the Scanner detects that the end of the source has been
reached further attempts to obtain a token return only this one.</P>

<P>The CONTEXT phrase in a <I>TokenTerm</I> means that the term is only
recognized when its right hand context in the input stream (i.e. the
characters following in the input stream) matches the <I>TokenExpr</I>
specified in brackets. Note that the context phrase is not part of the
token.</P>

<P><I>Examples</I> </P>

<PRE>   TOKENS
     ident  = letter {letter | digit} .
     real   = digit {digit} &quot;.&quot; {digit} [&quot;E&quot; [&quot;+&quot;|&quot;-&quot;] digit {digit}] .
     number = digit {digit} | digit {digit}&nbsp;CONTEXT (&quot;..&quot;) .
</PRE>

<P>The CONTEXT phrase in the above example allows the scanner to distinguish
between reals (e.g., 1.23) and range constructs (e.g., 1..2) that could
otherwise not be scanned with a single character lookahead. After reading
a &quot;1&quot; and a &quot;.&quot;, the scanner still works on both alternatives.
If the next character is again a &quot;.&quot; the &quot;..&quot; phrase
is pushed back to the input stream and a number is returned to the parser.
If the next character is not a &quot;.&quot;, the scanner continues with
the recognition of a real number. </P>

<P><B>Comments and ignorable characters.</B>
Usually spaces within the source text of a program are irrelevant, and in
scanning for the start of a token, a Coco/R generated scanner will simply
ignore them.  Other separators like tabs, line ends, and form feeds may also
be declared irrelevant, and some applications may prefer to ignore the
distinction between upper and lower case input.</P>

<P>Comments are difficult to specify with the regular expressions used to
denote tokens - indeed, nested comments may not be specified at all in this
way. Since comments are usually discarded by a parsing process, and may
typically appear in arbitrary places in source code, it makes sense to have a
special construct to express their structure.</P>

<P>Ignorable aspects of the scanning process are defined in Cocol by</P>

<PRE>   Comments  = &quot;COMMENTS&quot; &quot;FROM&quot; TokenExpr &quot;TO&quot; TokenExpr [ &quot;NESTED&quot; ] .
   Ignorable = &quot;IGNORE&quot; ( &quot;CASE&quot; | CharacterSet ) .
</PRE>

<P>where the optional keyword NESTED should have an obvious meaning.  A
practical restriction is that comment brackets must not be longer than 2
characters.  It is possible to declare several kinds of comments within a
single grammar, for example:</P>

<PRE>   COMMENTS FROM &quot;/*&quot; TO &quot;*/&quot; NESTED
   COMMENTS FROM &quot;//&quot; TO eol
   IGNORE '\r' + '\t' + '\n'
</PRE>

<P>The set of ignorable characters in this example is that which includes the
standard white space separators in ASCII files.  The null character CHR(0)
should not be included in any ignorable set.  It is used internally by Coco/R
to mark the end of the input file.</P>

<P><B>Pragmas</B>. Pragmas, like comments, are tokens that may occur anywhere
in the input stream, but, unlike comments, cannot be ignored. Pragmas
are often used to allow programmers to select compiler switches dynamically.
Since it becomes impractical to modify the phrase structure grammar to handle
this, a special mechanism is provided for the recognition and treatment of
pragmas. In Cocol they are declared like tokens, but may have an associated
semantic action that is executed whenever they are recognized by the
scanner.</P>

<PRE>   Pragmas     = &quot;PRAGMAS&quot; { Pragma } .
   Pragma      = Token [ SemAction ] .
   SemAction   = &quot;(.&quot; ArbitraryCSharpCode &quot;.)&quot; .
</PRE>

<P><I>Example</I> </P>

<PRE>   PRAGMAS
     option = &quot;$&quot; {letter} .  (. i := 1;
                                 while (i &lt; t.val.length()) {
                                   if (t.val.charAt(i) == 'A') ...;
                                   else if (t.val.charAt(i) == 'B') ...;
                                   i++;
                                 } .)
</PRE>

<P>Note: The next token to be delivered to the parser is available in the
field <I>t</I> of the generated parser (see also Section 3.1). It holds the
token code (<I>t.kind</I>), the token text (lexeme) (<I>t.val</I>) and the
token position (<I>t.pos</I>, <I>t.line</I>, <I>t.col</I>). Thus, <I>t.val</I>
is the string of the recognized pragma. </P>

<P><B>User names.</B>
Normally the generated scanner and parser use integer literals to denote the
symbols and tokens.  This makes for unreadable parsers, in some estimations.
If a NAMES part is added to the scanner specification, or the
pragma $N is used, or the command line directive -N, Coco/R will
generate code that uses names for the symbols.  By default these names have a
rather stereotyped form, but preferred user-defined names may be specified - this
may sometimes be needed to help resolve name clashes (for example, between the
default names that would be chosen for "." and "point").</P>

<PRE>   UserNames  = &quot;NAMES&quot; { UserName } .
   UserName   = TokenIdent  &quot;=&quot; ( identifier | string ) &quot;.&quot; .
</PRE>

<P><I>Example</I> </P>

<PRE>   NAMES
     period   = &quot;.&quot; .
     ellipsis = &quot;...&quot; .
</PRE>

<P>The ability to name terminals is an extension over the original
implementation.</P>

<H3>2.4 Parser specification</H3>

<P>The parser specification is the main part of the compiler description. It
contains the productions of an attributed grammar specifying the syntax of the
language to be recognized, as well as the action to be taken as each phrase or
token is recognized. The form of the parser specification may itself be
described in EBNF as follows:</P>

<PRE>   ParserSpecification = &quot;PRODUCTIONS&quot; { Production } .
   Production          = NonTerminal [ FormalAttributes ] [ LocalDeclarations ]
                         '=' Expression &quot;.&quot; .
   FormalAttributes    =   '&lt;' ArbitraryCSharpParameterDeclarations '&gt;'
                         | '&lt;.' ArbitraryCSharpParameterDeclarations '.&gt;' .
   LocalDeclarations   = &quot;(.&quot; ArbitraryCSharpDeclarations &quot;.)&quot; .
   NonTerminal         = ident .
</PRE>

Any name appearing in a <I>Production</I> that has not earlier been declared as a
terminal token is considered to be the name of a <I>NonTerminal</I>. There must be
exactly one production for every <I>NonTerminal</I> (this production may, of course,
specify a list of alternative right-hand sides).</P>

<P><B>Productions</B>. A production is considered as a "procedure" that parses
a nonterminal. It has its own scope for attributes and local names, and is
made up of a left-hand side and a right-hand side which are separated by an
equal sign. The left-hand side specifies the name of the nonterminal, together
with its formal attributes and local declarations. The right-hand side
consists of an EBNF expression that specifies the structure of the nonterminal
as well as its translation. </P>

<P>As in the case of tokens, some subtleties in the specification of productions
should be emphasized.  Firstly, the productions may be given in any order, but
a production must be given for a <I>GoalIdentifier</I> that matches the name used
for the grammar.</P>

<P>The <I>formal attributes</I> enclosed in angle brackets "<" and ">"
or "<." and ".>" essentially consist of parameter declarations in C#.
For example, the declaration

<pre>   SomeSymbol &lt;out int n, string name&gt; = ... .
</pre>
is translated into

<pre>   static void SomeSymbol (out int n, string name) {
     ...
   }
</pre>

<P>The <I>local declarations</I> are arbitrary C# declarations enclosed
in &quot;(.&quot; and &quot;.)&quot;. A production constitutes a scope
for its formal attributes and its locally declared names. Terminals and
nonterminals, globally declared fields and methods, as well as public
classes, are visible in any production. </P>

<P>The syntax of attributes and local declarations is not checked by Coco/R;
this is left to the responsibility of the C# compiler that will actually
compile the generated application.</P>

<P>The goal symbol may not have any <I>FormalAttributes</I>.  Any information
that the parser is required to pass back to the calling driver program must be
handled in other ways.  At times this may prove slightly awkward.</P>

<P>It may happen that an identifier chosen as the name of a <I>NonTerminal</I>
may clash with one of the internal names used in the rest of the system.  Such
clashes will only become apparent when the application is compiled and linked,
and may require the user to redefine the grammar to use other identifiers.</P>

<P><B>Expressions</B>. An EBNF expression defines the context-free structure
of some part of the source language, together with the attributes and semantic
actions that specify how the parser must react to the recognition of each
component.</P>

<PRE>   Expression = Term { '|' Term } .
   Term       = Factor { Factor } .
   Factor     =   [ &quot;WEAK&quot; ] TokenSymbol
                | NonTerminal [ Attributes ]
                | SemAction
                | &quot;ANY&quot;
                | &quot;SYNC&quot;
                | '(' Expression ')'
                | '[' Expression ']'
                | '{' Expression '}' .
   Attributes =   '&lt;' ArbitraryCSharpParameters '&gt;'
                | '&lt;.' ArbitraryCSharpParameters '.&gt;'.
   SemAction  = &quot;(.&quot; ArbitraryCSharpStatements &quot;.)&quot; .
   Symbol     = ident | string .
</PRE>

<P>The <I>Attributes</I> enclosed in angle brackets that may follow a
<I>NonTerminal</I> in the context of a <I>Factor</I> effectively denote
the actual parameters that will be used in calling the corresponding routine.
If a <I>NonTerminal</I> has been defined on the left-hand side of a
<I>Production</I> to have <I>FormalAttributes</I>, then every occurrence of that
<I>NonTerminal</I> in the right-hand side of a <I>Production</I> must have a
list of actual attributes that correspond to the <I>FormalAttributes</I>
according to the parameter compatibility rules of C#. However, the
conformance is only checked when the generated parser class is compiled.</P>

<P>If the attributes contain the symbol "&gt;" they must be enclosed in
brackets of the form "&lt;." and ".&gt;", for example

<PRE>  &lt;. a>b .&gt;
</PRE>

<P>A <I>semantic action</I> is an arbitrary sequence of C# statements
enclosed in &quot;(.&quot; and &quot;.)&quot;. These are simply incorporated
into the generated parser; their syntax is only checked when the generated
parser is compiled.  The digraph &quot;(.&quot; is not allowed within this
text, nor are incomplete strings, as these are symptomatic of mismatched
text.</P>

<P>The symbol ANY denotes any terminal that cannot follow ANY in that context.
It can conveniently be used to parse structures that contain arbitrary
text. For example, the translation of a Cocol/R attribute list incorporates
actions that look as follows: </P>

<PRE>   Attributes &lt;out int len&gt;
             (. int pos; .)
   = '&lt;'     (. pos := token.pos + 1; .)
     {ANY}
     '&gt;'     (. len := token.pos - pos; .) .
</PRE>

<P>In this example the closing angle bracket is an implicit alternative
of the ANY symbol in curly brackets. The meaning is that ANY matches any
terminal except '&gt;'. <I>token.pos</I> is the source text position of
the most recently recognized token (<I>token</I> is a field of the generated
parser; see Section 31). </P>

<P>Note that it is possible to write a production in which an action appears
to be associated with an alternative for an expression that contains no
terminals or nonterminals. This feature is often useful. For example, we
might have</P>

<PRE>   Option =  "push" (. stack[++top] = item; .)
           | "pop"  (. item = stack[top--]; .)
           |        (. MonitorStack(); .) .
</PRE>

<P><B>Syntax error handling</B>. The programmer has to give some hints
in order to allow Coco/R to generate good and efficient error handling.
Firstly, synchronization points have to be specified. A synchronization point
is a location in the grammar where particularly &quot;safe&quot; symbols are
expected in the sense that they are hardly ever missing or mistyped, nor
do they appear very often in the grammar.  </P>

<P>In most languages, good candidates for synchronization points are the
beginning of a statement (where <I>if</I>, <I>while</I>, etc. are expected),
or the beginning of a declaration sequence (where <I>static</I>,
<I>public</I>, etc. are expected). A synchronization point is specified by the
symbol SYNC. When the generated parser reaches such a point, it skips all
input until a symbol occurs that is expected at that point. The end-of-file
symbol is always included amongst the synchronization symbols. This guarantees
that synchronization terminates at least at the end of the source text. </P>

<P>The union of all synchronization sets we shall denote by <I>AllSyncs</I>.
Error-handling can be further improved by specifying which terminals are
weak in a certain context. A &quot;weak terminal&quot; is a symbol that
is often mistyped or missing, such as the semicolon between statements,
and is denoted by preceding it in the grammar with the keyword WEAK. When
the parser expects a weak symbol, but does not find it in the input stream,
it adjusts the input to the next symbol that is either a legal successor
of the weak symbol or a member of <I>AllSyncs </I>(symbols expected at
synchronization points are considered to be particularly &quot;strong&quot;,
so that it makes sense that they are never skipped). </P>

<P><I>Examples</I> </P>

<PRE>   StatementSeq = Statement {WEAK &quot;;&quot; Statement} .
   Declaration = SYNC (&quot;CONST&quot; | &quot;TYPE&quot; | &quot;VAR&quot; | ...) .
</PRE>

<P>As in the first of the above examples, iterations frequently start with
a weak terminal, in situations that can be generally described by</P>

<PRE>   Sequence = FirstPart { WEAK ExpectedTerminal IteratedPart } LastPart .</PRE>

<P>Such weak separators are handled in a special way: if <I>ExpectedTerminal
</I>is not recognized, source tokens are consumed until a token is found
that is either a member of FIRST(<I>IteratedPart</I>) or of FIRST(<I>LastPart</I>)
or of <I>AllSyncs</I>.</P>

<P><B>LL(1) requirements</B>. Recursive descent parsing requires that the
grammar of the parsed language satisfies the LL(1) property. This means
that at any point in the grammar the parser must be able to decide on the
basis of a single lookahead symbol which of several possible alternatives
have to be selected. For example, the following production is not LL(1):</P>

<PRE>   Statement =   ident ':=' Expression
               | ident ['(' ExpressionList ')'] .
</PRE>

<P>Both alternatives start with the symbol <I>ident</I>. When the parser comes
to a statement and <I>ident</I> is the next input symbol, it cannot distinguish
between these alternatives. However, the production can easily be transformed
into </P>

<PRE>   Statement = ident ( &quot;:=&quot; Expression |  ['(' ExpressionList ')'] ) .
</PRE>

<P>where all alternatives start with distinct symbols. There are LL(1)
conflicts that are not as easy to detect as in the above example. It can be
hard for programmers to detect these without a tool like Coco/R that checks the
grammar automatically; Coco/R gives appropriate error messages that suggest
how to correct any LL(1) conflicts.</P>

<A NAME="UserGuide"></A>

<H2>3. User Guide</H2>

<P>In order to generate an application (like a compiler) using Coco/R, a user has to </P>

<UL>
<LI>write an attributed grammar in Cocol/R; </LI>

<LI>process the grammar with Coco/R to generate a scanner and a parser;
</LI>

<LI>develop any supporting classes (like code generators or table handlers)
that are used in the grammar; </LI>

<LI>write a driver (main method) that initializes the scanner and calls the parser. </LI>
</UL>

<P>Coco/R processes an attributed grammar to generate the following files: </P>

<UL>
<LI><I>Scanner.cs</I>: containing the classes <I>Scanner</I>, <I>Buffer</I>
and <I>Token</I>.
</LI>

<LI><I>Parser.cs</I>: containing the class <I>Parser</I> (and, optionally a
class <I>SYM</I> if the NAMES feature is used in the scanner specification, $N
is used as a pragma, or -N is specified on the command line). </LI>

<LI><I>Grammar.cs</I>: containing a driver class (if the optional -C command
line directive is used).

<LI><I>listing.txt</I>: containing trace output (if any). </LI>

</UL>

<P>All generated classes belong to a namespace, whose name is the name
of the start symbol of the attributed grammar.

These classes are generated by inserting appropriate C# source into
so-called <I>frame files</I>: </P>

<UL>
<LI><I>Scanner.frame</I> </LI>

<LI><I>Parser.frame</I> </LI>

<LI><I>Driver.frame</I> or <i>GrammarName.frame</I> (only needed if the $C
pragma or -C command line parameter is used) </LI> </UL>

<P>It is suggested that the frame and support files be copied to the same
directory as the attributed grammar, and then the output files will be
generated in that same directory.  Alternatively, set the environment variable
CRFRAMES to point to a directory where Parser.frame, Scanner.frame and
Driver.frame are to be found, for example

<PRE>
        SET CRFRAMES=C:\COCO\FRAMES
</PRE>

<P>This version of Coco/R is run in "command-line" mode, with a parameter
specifying the name of the attribute grammar to be processed, for example:</P>

<PRE>
   Coco MyGrammar.atg
</PRE>

<P>Optional extra command line parameters may be added in a familiar way to
control the generation and trace output, for example

<PRE>
   Coco -FX MyGrammar.atg
</PRE>

Of course, the executable COCO.EXE must be found along the system PATH.
Permissible command line parameters include

<P>
<TABLE>
<TR>
<TD><TT>&nbsp;&nbsp-A&nbsp;&nbsp</TT> </TD><TD>trace automaton          </TD>
<TD><TT>&nbsp;&nbsp-C&nbsp;&nbsp</TT> </TD><TD>generate compiler driver </TD>
</TR><TR>
<TD><TT>&nbsp;&nbsp-F&nbsp;&nbsp</TT> </TD><TD>list first/follow sets   </TD>
<TD><TT>&nbsp;&nbsp-G&nbsp;&nbsp</TT> </TD><TD>print syntax graph       </TD>
</TR><TR>
<TD><TT>&nbsp;&nbsp-I&nbsp;&nbsp</TT> </TD><TD>trace first sets         </TD>
<TD><TT>&nbsp;&nbsp-J&nbsp;&nbsp</TT> </TD><TD>display <tt>ANY</tt> and <tt>SYNC</tt> sets
</TR><TR>
<TD><TT>&nbsp;&nbsp-M&nbsp;&nbsp</TT> </TD><TD>merge error messages with trace output </TD>
<TD><TT>&nbsp;&nbsp-N&nbsp;&nbsp</TT> </TD><TD>generate symbol names    </TD>
</TR><TR>
<TD><TT>&nbsp;&nbsp-P&nbsp;&nbsp</TT> </TD><TD>display statistics       </TD>
<TD><TT>&nbsp;&nbsp-S&nbsp;&nbsp</TT> </TD><TD>list symbol table        </TD>
</TR><TR>
<TD><TT>&nbsp;&nbsp-T&nbsp;&nbsp</TT> </TD><TD>test grammar only      </TD>
<TD><TT>&nbsp;&nbsp-X&nbsp;&nbsp</TT> </TD><TD>list cross reference table
</TR>
</TABLE>

<P>By default, syntactic and semantic error messages and warnings are sent to
the standard output stream (<I>System.out</I>) in a format that several
editors can use to highlight the errors. If the <tt>-M</tt> option is
exercised the error messages are merged with a source listing and sent to the
trace output file <i>listing.txt</i>. It is possible to develop yet other ways
of dealing with error output (see section 3.4).</P>

<H3>3.1 The Parser interface</H3>

<P>A parser generated by this version of Coco/R has the following interface: </P>

<PRE>   class Parser {
     static Token token;              // last recognized token
     static Token t;                  // lookahead token
     static void Parse();             // instigate parse
     static void Error(int n);        // report syntax error n
     static void SemError(int n);     // report semantic error n
     static bool Successful();        // return true if no errors reported
     static string LexString();       // return exact text of parsed token
     static string LexName();         // return text of parsed token (uppercase?)
     static string LookAheadString(); // return exact text of lookahead token
     static string LookAheadName();   // return text of lookahead token (uppercase?)
   }
</PRE>

<P>The parser is started by calling the method <I>Parse()</I>. This is done in
the main method of the compiler (see Section 3.5). <I>Parse</I> and other internal
private methods of the class repeatedly call the scanner to get tokens, and
execute semantic actions at the appropriate places. </P>

<P>The field <I>token</I> holds the most recently parsed token. The field
<I>t</I> is the lookahead token that has already been obtained from the
scanner but not yet parsed. In a semantic action, <I>token</I> refers to the
token immediately before the action commences, and <I>t</I> to the token
immediately visible after the action. </P>

<P>The other methods represent extensions to the original implementation, and
have been provided for convenience, particularly for programmers familiar with
the C, Pascal and Modula versions of Coco/R.  These methods provide hooks into
the parser and error reporters which may be useful in developing semantic
actions within the parser, as well as in developing methods in supporting
classes like table handlers.</P>

Calls to <I>Error(n)</I> are automatically inserted into the generated parser
for the purposes of reporting syntax errors.  The error numbers <I>n</I> are
generated by Coco/R along with appropriate error messages that can be
displayed and related to the position of the <I>lookahead</I> token.  There
may be situations where supporting classes also find it convenient to invoke
this method. However, these are more likely to wish to invoke the
<I>SemError(n)</I> method with an appropriately chosen error number, which
will report a semantic error message related to the position of the most
recently <I>parsed</I> token.  In order to obtain meaningful messages, a
programmer should arrange for these to be incorporated into a method such as
the <i>SemErr</i> method found in the specimen driver classes in the
distribution, which can be installed into the <i>Errors</i> class.

<P>A call to the method <I>Successful()</I> will return <I>true</I> if parsing
has revealed no syntactic or semantic errors up to the time when the call is
made.</P>

<P>The remaining methods return a string that represent a lexeme for one of
the tokens scanned.  <I>LexString()</I> and <I>LexName()</I> return identical
strings unless the scanner has been built with the IGNORE CASE option
selected; in this case <I>LexName</I> returns the string converted to upper
case.</P>


<H3>3.2 The Token interface</H3>

<P>A token obtained from the scanner has the following structure: </P>

<PRE>   class Token {
     int kind;    // token kind
     int pos;     // token position in the source file (starting at 0)
     int line;    // token line (starting at 1)
     int col;     // token column (starting at 1)
     string str;  // token text (exactly as found in the source)
     string val;  // token text (converted to upper case if IGNORE CASE used)
   }
</PRE>

<P>Knowledge of this structure may be of use in the construction of semantic
actions and support methods.  Values of <I>kind</I> are assigned by Coco/R
dynamically.  Values of <I>pos, line</I> and <I>col</I> may be of use in error
reporting.  However, in many applications the inner details of a token may
not be needed, as the parser interface provides methods for dealing with error
reporting.</P>

<H3>3.3 The Scanner interface</H3>

<P>A scanner generated by Coco/R has the following interface: </P>

<PRE>   class Scanner {
     static void Init(String file);                  // constructor
     static Token Scan();                            // scan for next token
   }
</PRE>

<P>The actual scanner is provided by the method <I>Scan()</I> which returns a
<I>Token</I> object every time it is called by the parser (these calls are
generated by Coco/R; users should not call <I>Scan</I> directly). When the
input is exhausted, it returns a special end-of-file token (with the token
code 0) that is known to the parser.</P>

<P>The scanner has to be initialized by calling the method <I>Init</I>,
passing it the name of the source file to be scanned. 

<P>The scanner makes use of a generated <i>Buffer</i> class with the interface

<PRE>   class Buffer {
     static void Fill(string fileName)  // fills scanner internal buffer with contents of file
     static int Read();                 // reads the next source character
     static int Pos;                    // Property gets/sets the reading position
   }
</PRE>

but this is of little direct interest to the programmer.

<H3>3.4 The Errors interface</H3>

<P>Coco/R generates an <I>Errors</I> class that can be used for reporting
errors.  This class has the following interface: </P>

<PRE>
     delegate void ErrorProc(int n, int line, int col);
     delegate void StoreProc(int n, int line, int col, string s);
     delegate void SummaryProc(string s);

     class Errors {
     static int count;                // number of errors detected
     static ErrorProc SynErr;         // syntactic errors
     static ErrorProc SemErr;         // semantic errors
     static StoreProc StoreError;     // store error for future treatment
     static SummaryProc Summarize;    // summarize number of errors reported
     static string fileName;          // the source file name

     static void Exception(string s); // report fatal error and stop
   }
</PRE>

<p>
Coco generates and installs into <i>SynErr</i> a method that stores
appropriate syntax error messages. This method is automatically called by the
parser, in contrast to <i>SemErr</i> which has to be called by the programmer.
By default, <i>SemErr</i> contains a procedure that stores just the line and
colum of a semantic error as well as the error number. Programmers who wish
to generate more readable error messages have to install their own procedure
in <i>SemErr</i> as in the following code fragment:

<pre>   Errors.SemErr = new ErrorProc(MySemErr);

   static void MySemErr(int n, int line, int col) {
     Errors.count++;
     string s;
     switch (n) {
       case 1: s = "Name not found"; break;
       case 2: s = "Name already declared"; break;
       ...
       default: s = "Semantic error " + n; break;
     }
     Errors.StoreError(n, line, col, s);                                        /* PDT 40 */
   }
</pre>

After parsing, the number of detected errors can be obtained from
<i>count</i>.

<p>The default <I>StoreError</I> simply prints the error messages to
<i>System.out</I>.  A user can install a more sophisticated system, as
exemplified in Coco itself, where the $M pragma or -M compile line option
installs a method that merges the messages into a highlighted source listing
as part of the trace output file <i>listing.txt</i>.

<P>The method <I>Exception(s)</I> can be called by the programmer when a
serious error occurs which makes any continuation impossible. After printing
the parameter <I>s</I> the compilation is aborted. </P>

<H3>3.5 The driver program</H3>

<p>
The main class must be provided by the programmer. It has to initialize the
scanner, possibly initialize <i>Errors.SemErr</i>, and call the parser. In its
simplest form it looks as straightforward as this:

<pre>   public class MainProgram {

     public static void Main(string[] arg) {
       Scanner.Init(arg[0]);
       Parser.Parse();
       Console.WriteLine("{0} errors detected", Errors.count);
     }
   }
</pre>


<P>In this version of the system provision has been made for the
generation of a driver program if the $C pragma or -C command line option is
exercised. In its simplest form this driver class is generated from a frame
file <i>Driver.frame</i>.  The generic driver might suffice for simple
applications, but is likely to require some modification, and in particular to

<ul>
<li>Define the error messages associated with any user defined semantic errors
as needed in calls to <i>Parser.SemError</i>

<li>Process command line parameters

<li>Choose how errors are to be summarized/reported

<li>Handle any specific <tt>using</tt> clauses needed


</ul>

<p>
The driver generation system looks for a file <i>Grammar.frame</i> (where
<i>Grammar</i> matches the goal symbol of the parser) to use in
preference to the generic file <i>Driver.frame</i>.  It is suggested that if
users wish to make use of this feature they should use <i>Driver.frame</i> as
a guide towards developing a frame which can be used for their specific
application.


<H3>3.6 Grammar tests</H3>

<P>Coco/R performs several tests to check that the grammar, besides being
syntactically correct, is also semantically well-formed. If one of the
following error messages is produced, no compiler parts are generated. </P>

<BlockQuote>
<DL>
<DT>No production for X </DT>

<DD>The nonterminal X has been used but there is no production for it.
</DD>

<DT>X cannot be reached </DT>

<DD>There is a production for nonterminal X, but X cannot be derived from
the start symbol. </DD>

<DT>X cannot be derived to terminals </DT>

<DD>For example, if there is a production X = &quot;(&quot; X &quot;)&quot; .
</DD>

<DT>X --&gt; Y, Y --&gt; X </DT>

<DD>X and Y are nonterminals with circular derivations. </DD>

<DT>Tokens X and Y cannot be distinguished </DT>

<DD>The terminal symbols X and Y are declared to have the same structure,
e.g.,<BR>
&nbsp;&nbsp;&nbsp;&nbsp;integer = digit {digit} .<BR>
&nbsp;&nbsp;&nbsp;&nbsp;real = digit {digit} [&quot;.&quot; {digit}] .<BR>
In this example, a digit string could be recognized either as an integer
or as a real. </DD>
</DL>
</blockquote>

<P>The following messages are warnings. They may indicate an error but
they may also describe desired effects. The generated compiler parts are
valid, although if an LL(1) error is reported for a construct X one must be
aware that the generated parser will choose the first of several possible
alternatives for X. </P>

<blockquote>
<DL>
<DT>X deletable </DT>

<DD>X can be derived to the empty string, e.g., X = {Y} . </DD>

<DT>LL(1) error in X: Y is the start of more than one alternative </DT>

<DD>Several alternatives in the production of X start with the terminal
Y, e.g.,<BR>
&nbsp;&nbsp;Statement = ident &quot;:=&quot; Expression | ident [ActualParameters] .
</DD>

<DT>LL(1) error in X: Y is the start and successor of a deletable structure
</DT>

<DD>Deletable structures are usually denoted by [...] and {...}, e.g.,<BR>
&nbsp;&nbsp;&nbsp;&nbsp;qualident = [ident &quot;.&quot;] ident .<BR>
&nbsp;&nbsp;&nbsp;&nbsp;Statement = &quot;IF&quot; Expression &quot;THEN&quot; Statement
[&quot;ELSE&quot; Statement] .<BR>
The ELSE at the start of the else-part may also be a successor of a statement.
This LL(1) conflict is known under the name &quot;dangling else&quot;.
</DD>
</DL>
</blockquote>

<H3>3.7 Trace output</H3>

<P>Coco/R can produce various output that can help in spotting LL(1) errors
or in understanding the generated parts. Trace output can be switched on with
command line options, or with the pragma </P>

<PRE>   '$' {digit | letter}
</PRE>

<P>at the beginning of the attributed grammar (for example $FS).  The output
goes to the file <I>listing.txt</I>. The effect of the switches is as follows:
</P>

<TABLE>
<TR>
<TD>
&nbsp;&nbsp;&nbsp;0 or A:<BR>
&nbsp;&nbsp;&nbsp;1 or F:<BR>
&nbsp;&nbsp;&nbsp;2 or G:<BR>
&nbsp;&nbsp;&nbsp;3 or I:<BR>
&nbsp;&nbsp;&nbsp;4 or J:<BR>
&nbsp;&nbsp;&nbsp;5 or T:<BR>
&nbsp;&nbsp;&nbsp;6 or S:<BR>
&nbsp;&nbsp;&nbsp;7 or X:<BR>
&nbsp;&nbsp;&nbsp;8 or P:<BR>
&nbsp;&nbsp;&nbsp;9 or M: </TD>

<TD>prints the states of the scanner automaton<BR>
prints the First and Follow sets of all nonterminals<BR>
prints the syntax graph of the productions<BR>
traces the computation of the First sets<BR>
prints the sets associated with ANY and SYNC<br>
tests the grammar only; generate no scanner or parser<br>
prints the symbol table (terminals, nonterminals, pragmas)<BR>
prints a cross reference list of all syntax symbols<BR>
prints statistics about the Coco/R run<br>
prints error messages merged with a source listing</td> </TR>
</TABLE>

<P><A NAME="hints"></A></P>

<H2>4. Hints for Advanced Users of Coco/R</H2>

<H3>Providing a hand-written scanner</H3>

<P>Scanning is a time-consuming task. The scanner generated by Coco/R is
optimized, but it is implemented as a deterministic finite automaton, which
introduces some overhead. A manual implementation of the scanner can be
slightly more efficient. For time-critical applications a programmer may
wish to generate a parser but provide a hand-written scanner. This can
be done by declaring <I>all</I> terminal symbols (including literals) as tokens,
but without defining their structures by EBNF expressions, e.g., </P>

<PRE>   TOKENS
     ident
     number
     &quot;if&quot;
     ...
</PRE>

<P>If a named token is declared without structure, no scanner is generated.
Tokens are assigned numbers in the order of their declaration; i.e., the
first token gets the number 1, the second the number 2, etc. The number
0 is reserved for the end-of-file symbol. The hand-written scanner has
to return token numbers according to this convention. It must have the
interface described in Section 3.3.</P>

<H3>Tailoring the generated compiler parts to specific needs</H3>

<P>Using a generator usually increases productivity but decreases flexibility.
There are always special cases that can be handled more efficiently in
a hand-written implementation. A good tool handles routine matters in a
standard way, but gives a user the chance to change the standard policy
if this seems appropriate. Coco/R generates the Scanner, Parser and ErrorStream
classes from source texts (so-called frames) stored as the files <I>Scanner.frame</I>
and <I>Parser.frame</I>. It does so by inserting grammar-specific parts
into these frames at the places marked with distinctive --&gt;keys . A user
may edit the frames (within reason!) and thereby change the internally used
algorithms. For example, a user can implement a different buffering scheme for
input characters. </P>

<H3>Accessing the lookahead token</H3>

<P>The generated parser offers the lookahead token <I>t</I> that can be
used to access the next token in the input stream (i.e. the one that has
already been scanned but not yet parsed). The following example shows one
way of computing the start and the end position of a sequence of tokens
enclosed in curly brackets: </P>

<PRE>
   &quot;{&quot;     (. start = t.pos; .)   // start of the first ANY
   {ANY}   (. end = token.pos; .) // start of the last ANY
   &quot;}&quot;
</PRE>

<H3>Controlling the Parser by semantic information</H3>

<P>Ideally, syntax analysis should be &quot;context-free&quot;, that is,
independent of semantic analysis (symbol table handling, type checking, etc.).
However, many languages have constructs that can only be distinguished if one
also considers semantic information (e.g., the type) associated with various
identifiers. In the language Oberon, for example, a <I>Designator</I> is
defined as </P>

<PRE>   Designator = Qualident
                {&quot;.&quot; ident | &quot;^&quot; | &quot;[&quot; ExprList &quot;]&quot; | &quot;(&quot; Qualident &quot;)&quot; } .
</PRE>

<P>where in the context of an <I>Expression</I>, code like <I>x(T)</I> means a
type guard (i.e., <I>x</I> is asserted to be of type <I>T</I>). A
<I>Designator</I> may also be used in a <I>Statement</I> </P>

<PRE>   Statement = ... | Designator [&quot;(&quot; ExprList &quot;)&quot;] | ...  .
</PRE>

<P>but in this context <I>x(T)</I> can be interpreted as a regular procedure
name <I>x</I> followed by a parameter <I>T</I>. The two interpretations of
<I>x(T)</I> can only be distinguished by looking at the type of <I>x</I>. If
<I>x</I> is a regular procedure then the opening bracket is the start of a
parameter list, otherwise the bracket belongs to a type guard. </P>

<P>Coco/R allows control of the parser from within semantic actions to
a certain degree. A <I>Designator</I>, for example, can be processed in the
following way: </P>

<PRE>   Designator &lt;^Item x&gt; =
     Qualident &lt;^x&gt;
     {                        (. if (...x is a procedure...) return; .)
       &quot;(&quot; Qualident &lt;^y&gt; &quot;)&quot; (. ... process type guard ... .)
     | ...
     } .
</PRE>

<P>When an opening bracket is seen after a <I>Qualident</I>, the alternative
starting with an opening bracket is selected. The first semantic action of
this alternative checks for the type of <I>x</I>. If <I>x</I> is a regular
procedure, the parser returns from the production (and presumably continues in
the <I>Statement</I> production whence it was invoked). </P>

<A NAME="Taste"></A>

<H2>5. A Sample Compiler</H2>

<P>This section shows how to use Coco/R for building a compiler for a tiny
programming language called &quot;Taste&quot;. Taste bears some resemblance to
Modula-2 or Oberon. It has variables of type INTEGER and BOOLEAN, and regular
procedures without parameters. It allows assignments, procedure calls, and IF-
and WHILE-statements. Integers may be read from an input stream and written to
an output stream, one to a line. Expressions may incorporate arithmetic
operators (+, -,* , /) and relational operators (=, &lt;, &gt;). </P>

<P>Here is an example of a Taste program, which can be found in the file <A
HREF="../Taste/Test.TAS">Test.TAS</A>: </P>

<PRE>  MODULE Example;
    VAR n: INTEGER;

    PROCEDURE SumUp; (*build the sum of all integers from 1 to n*)
      VAR sum: INTEGER;
    BEGIN
      sum:=0;
      WHILE n&gt;0 DO sum:=sum+n; n:=n-1 END;
      WRITE sum
    END SumUp;

  BEGIN
    READ n;
    WHILE n&gt;0 DO SumUp; READ n END
  END Example.
</PRE>

<P>The full grammar of Taste can be found
<A HREF="Taste.htm">here</A>.
Of course Taste is too restrictive to be used as a real programming language.
Its purpose is just to give one a taste of how to write a compiler with
Coco/R.</P>

<p>
Although the grammar of the Taste language is identical with the Linz release,
some of the internal details have been altered in this release.


<H3> Execution</H3>

<P>The Taste compiler is a compile-and-go compiler, which means that it
reads a source program and translates it into a target program which is
executed (i.e. interpreted) immediately after the compilation. Once it has been
built, the Taste system can be run in "command line mode", for example

<PRE>   Taste Test.TAS
</PRE>

<P>After a succesful compilation the program is interpreted, and the user
is prompted for the source of data and the sink for the output.

<H3>The Target Code</H3>

<P>We define an abstract stack machine as the target for the translation of
Taste programs. The compiler translates a source program into instructions for
this machine, which will later be interpreted. The machine uses the following
data structures (the code array is filled by the compiler): </P>

<PRE>   char code[];   code memory
   int stack[];   stack with frames for local variables
   int top;       stack pointer (points to next free stack element)
   int pc;        program counter
   int base;      base address of current frame
</PRE>

<P>The instructions have variable length and are described by the following
table (the initial values of the registers are: base=0; top=3;): </P>

<PRE>LOAD l,a  Load value          Push(stack[Frame(l)+a]);
LIT i     Load literal        Push(i);
STO l,a   Store               stack[Frame(l)+a]=Pop();
ADD       Add                 j=Pop(); i=Pop(); Push(i+j);
SUB       Subtract            j=Pop(); i=Pop(); Push(i-j);
DIV       Divide              j=Pop(); i=Pop(); Push(i/j);
MUL       Multiply            j=Pop(); i=Pop(); Push(i*j);
EQU       Compare if equal    j=Pop(); i=Pop(); if (i==j) Push(1); else Push(0);
LSS       Compare if less     j=Pop(); i=Pop(); if (i&lt;j) Push(1); else Push(0);
GTR       Compare if greater  j=Pop(); i=Pop(); if (i&gt;j) Push(1); else Push(0);
CALL l,a  Call procedure      slink=Frame(l); Push(slink); Push(base); Push(pc+2);
                              pc=a; base=top-3;
RET       Return from proc.   top=base+3; pc=Pop(); base=Pop(); dummy=Pop();
RES i     Reserve frame space top=top+i;
JMP a     Jump                pc=a;
FJMP a    False jump          if (Pop()==1) pc=adr;
HALT      End of the program  Halt;
NEG       Negation            Push(-Pop());
READ l,a  Read integer        stack[Frame(l)+a]=Read();
WRITE     Write integer       Writeln(Pop());
</PRE>

<P>The function <I>Frame(l)</I> returns the base address of the stack frame which
is (statically) <I>l</I> levels up in the stack. <I>l</I>=0 means the current frame,
<I>l</I>=1 means the statically surrounding frame, etc. The main program is also
considered as a procedure with a stack frame. Figure 2 shows the format
of a stack frame. </P>

<P><IMG SRC="Fig3.gif" ><BR>
<BR>
<B>Fig.2</B> Format of a stack frame </P>

<P>For example, the source code instruction </P>

<PRE>   i := i + 1
</PRE>

<P>is translated into the code </P>

<PRE>   LOAD 0,3  load value of i (at address 3 in current frame)
   LIT 1     load constant 1
   ADD
   STO 0,3   store result to i
</PRE>

<H3>Classes</H3>

<P>Figure 3 shows the primary classes of the compiler (an arrow from A to B
means that class A is used by class B). </P>

<P><IMG SRC="Fig2.gif" ><BR>
<BR>
<B>Fig.3</B> classes of the Taste compiler </P>

<P>The classes in the rectangle are generated by Coco/R, the others are
written by hand. The classes have the following responsibilities: </P>

<TABLE>
<TR>
<TD valign=top><A HREF="../Taste/Taste.cs">Taste</A> </TD>

<TD>Driver class. This obtains the name of the source file, initializes
the scanner and calls
the parser and the interpreter. This file also includes the SemErr method
for reporting meaningful semantic error messages.</TD>
</TR>

<TR>
<TD valign=top><A HREF="../Taste/Parser.cs">Parser</A> </TD>

<TD>Recursive descent parser generated from the attributed grammar
<A HREF="../Taste/Taste.ATG">Taste.ATG</A>.
The parser contains the semantic actions from the attributed grammar. They are
called and executed during parsing and do the actual compilation work.

</TD>
</TR>

<TR>
<TD valign=top><A HREF="../Taste/Scanner.cs">Scanner</A> </TD>

<TD>Scanner generated from <A HREF="../Taste/Taste.ATG">Taste.ATG</A>. </TD>
</TR>

<TR>
<TD valign=top><A HREF="../Taste/TL.cs">TL</A> </TD>

<TD>Symbol table with methods to handle scopes and to store and retrieve
semantic information associated with identifiers. </TD>
</TR>

<TR>
<TD valign=top><A HREF="../Taste/TC.cs">TC</A> </TD>

<TD>Code generator with methods for emitting instructions and storing these
in the simulated memory for later interpretation. This class also contains
the interpreter method and its related data structures. </TD>
</TR>
</TABLE>

<P>For instructions on where to download all these components together
see <A HREF="#Sources">Section 7.</A>
</P>


<H3>Summary - how to build your own compiler</H3>

<P>If you want to build a similar compiler, but for a source language of your
own, proceed as follows: </P>

<OL>
<LI>Describe the structure of the lexical tokens (identifiers, numbers,
etc) as exemplified in the CHARACTERS and TOKENS sections of the file
<A HREF="../Taste/Taste.ATG">Taste.ATG.</A></LI>

<LI>Describe the phrase structure and the translation process of the compiler in the form of
an attributed grammar, as exemplified in the PRODUCTIONS section of the
file <A HREF="../Taste/Taste.ATG">Taste.ATG.</A></LI>

<LI>Feed the complete ATG file to Coco/R to get the scanner and parser
of the compiler (in the form of C# classes named
Scanner.cs and Parser.cs).</LI>

<LI>Write a driver program with a main method.  This has to initialize the
scanner and call the parser.  Preferably also provide and install an
application-specific SemErr method that will handle semantic error messages in
a meaningful way. These classes are exemplified in the files <A

HREF="../Frames/Driver.frame">Driver.frame</A> and <A
HREF="../Taste/Taste.cs">Taste.cs</A>
</LI>

<LI>Write any other classes needed by the compilation process. These classes
typically contain methods that are called from the semantic actions of the
attributed grammar. For example, the Taste compiler has a symbol table class
(<A HREF="../Taste/TL.cs">TL.cs</A>) and a code generator (<A
HREF="../Taste/TC.cs">TC.cs</A>). </LI>

<LI>Compile the generated parts and the hand-written parts to get a running
compiler. </LI>
</OL>

<A NAME="Apps"></A>
<H2>6. Useful applications of Coco/R</H2>

<P>Coco/R has been used for various tasks (not only for writing compilers).
The following list should give you some ideas about useful applications: </P>

<UL>
<LI>An analyzer for the static complexity of programs. The analyzer evaluates
the kind of operators and statements, the program nesting and the use of
local and global variables to obtain a measure of the program complexity
and an indication of whether the program is well structured. </LI>

<LI>A cross reference generator which lists all occurrences of the objects
in a program according to their scope, together with information where the
objects have been assigned a value and where they have been referenced.
</LI>

<LI>An &quot;intelligent&quot; pretty printer which uses the structure
and the length of statements for proper indentation. </LI>

<LI>A program which generates an index for books and reports. The index
is generated from relations between page numbers and keywords entered in
an appropriate way. </LI>

<LI>The front end of a syntax oriented editor. A program is translated
into a tree representation which is the internal data structure of the
editor. </LI>

<LI>A program that builds a repository of symbols and their relations in
a program. The repository is accessed by a case tool. </LI>
</UL>

<P><A NAME="Sources"></A></P>

<H2>7. Sources of Coco/R</H2>

<p><b><i>We emphasize again that this version of Coco/R for C# is different from
the official Linz release.</i></b>

<P>The sources of this version of Coco/R for C# are as follows: </P>

<TABLE>
<TR>
<TD valign=top><A HREF="../Coco/Coco.ATG">Coco.ATG</A> </TD>

<TD>The attributed grammar. Describes the processing of a compiler description.
</TD>
</TR>

<TR>
<TD valign=top><A HREF="../Coco/Coco.cs">Coco.cs</A> </TD>

<TD>Driver class generated from Coco.frame. Initializes the scanner and calls
the parser. This file also contains the custom semantic error message handler
SemErr.
</TD>
</TR>

<TR>
<TD valign=top><A HREF="../Coco/Scanner.cs">Scanner.cs</A> </TD>

<TD>Scanner generated from Coco.ATG. </TD>
</TR>

<TR>
<TD valign=top><A HREF="../Coco/Parser.cs">Parser.cs</A> </TD>

<TD>Parser generated from Coco.ATG. </TD>
</TR>

<TR>
<TD valign=top><A HREF="../Coco/Tab.cs">Tab.cs</A> </TD>

<TD>Symbol table of Coco. Stores information about terminals and nonterminals.
</TD>
</TR>

<TR>
<TD valign=top><A HREF="../Coco/DFA.cs">DFA.cs</A> </TD>

<TD>This class builds the scanner automaton and generates the scanner source
file. </TD>
</TR>

<TR>
<TD valign=top><A HREF="../Coco/ParserGen.cs">ParserGen.cs</A> </TD>

<TD>This class builds a syntax graph of the grammar rules and generates
the parser source file from it. </TD>
</TR>

<TR>
<TD valign=top><A HREF="../Frames/Scanner.frame">Scanner.frame</A> </TD>

<TD>This is the frame file from which the scanner is generated. Coco/R inserts
compiler-specific parts into this frame file. </TD>
</TR>

<TR>
<TD valign=top><A HREF="../Frames/Parser.frame">Parser.frame</A> </TD>

<TD>This is the frame file from which the parser is generated. Coco/R inserts
compiler-specific parts into this frame file. </TD>
</TR>

<TR>
<TD valign=top><A HREF="../Coco/Coco.frame">Coco.frame</A> </TD>

<TD>This is the frame file from which the driver class is generated. Coco/R inserts
compiler-specific parts into this frame file. </TD>
</TR>
</TABLE>

<P>If you would prefer to download a ZIP file with the latest .cs source
files and the .frame and .ATG files for Coco/R and Taste, you
can get it from here
<A HREF="ftp://cs.ru.ac.za/pub/coco/ccoco110win.zip"> in Win95 ZIP format</A>,
or from here
<A HREF="ftp://cs.ru.ac.za/pub/coco/ccoco110unix.zip"> in Unix ZIP format</A>,
or from here
<A HREF="ftp://cs.ru.ac.za/pub/coco/ccoco110.tgz"> in Unix .tgz format</A>,
or from here
<A HREF="ftp://cs.ru.ac.za/pub/coco/jcoco110mac.zip"> in Macintosh ZIP format</A>
(these formats differ only in the way in which line marks appear in the source
files).

<p>The ZIP files must be decompressed with a tool that will retain long file
names and directory structures.</P>


<A NAME="port"></A> <H2>8. Portability issues</H2>

<P>Coco/R has been developed so that it can accept input and frame files in
any of the three common text file formats (Unix, MS-DOS, Macintosh); it
generates output in the format of the host system.</P>

<A NAME="Lit"></A> <H2>9. Literature</H2>

<p>
<a href="CocoDataStructures.pdf">CocoDataStructures.pdf</a> provides
a short description of the major data structures of the system. </p>

<P><A HREF="ftp://ftp.ssw.uni-linz.ac.at/pub/Reports/Coco.Report.ps.Z">Coco.Report.ps.Z<BR>
</A>The original report on Coco/R, including an implementation description.
</P>

<P><A HREF="ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Coco.Paper.ps.Z">Coco.Paper.ps.Z<BR>
</A>H.M&ouml;ssenb&ouml;ck: A Generator for Production Quality Compilers.
Lecture Notes in Computer Science 477, Springer-Verlag, 1990. </P>

<P><A HREF="http://cs.ru.ac.za/homes/cspt/compbook.htm">Compiler book using
Coco/R<BR>
</A>P.D.Terry: <I>Compilers and Compiler Generators - An Introduction With
C++</I>. International Thomson Computer Press, 1997<BR>
(This book uses the C++ version of Coco/R, and also supplies case studies for
the Modula-2 and Pascal versions.) </P>

<P>The book is, unfortunately, now out of print. International
Thomson decided to drop their Computer Press division, and this title was
one of those to be dropped after copies ran out. The copyright was returned to
the author, who has now created an online version at
<A HREF="http://www.scifac.ru.ac.za/compilers">
http://www.scifac.ru.ac.za/compilers.</A>
This site also contains compressed downloadable files of the online edition,
and files from which a paper copy may be produced on an HP Laserjet compatible
printer.

</BODY>
</HTML>

